Chris Pollett > Old Classes > CS267
( Print View )

Student Corner:
  [Grades Sec2]

  [Submit Sec2]

  [Class Sign Up Sec2]

  [Lecture Notes]
  [Discussion Board]

Course Info:
  [Texts & Links]
  [Topics/Outcomes]
  [Outcomes Matrix]
  [Grading]
  [HW/Quiz Info]
  [Exam Info]
  [Regrades]
  [Honesty]
  [Additional Policies]
  [Announcements]

HW Assignments:
  [Hw1] [Hw2] [Hw3]
  [Hw4] [Quizzes]

Practice Exams:
  [Mid 1] [Mid 2] [Final]

HW#3 --- last modified February 07 2019 15:03:21..

Solution set.

Due date: Nov 13

Files to be submitted:
Hw3.zip

Purpose: Learn how to analyze some of our inverted index algorithms. Code BM25. Use trec_eval software.

Related Course Outcomes:

The main course outcomes covered by this assignment are:

LO3 -- Be able to explain where BM25, BM25F and divergence from randomness statistics come from.

LO5 -- Demonstrate with small examples how incremental index updates can be done with log merging.

LO6 -- Be able to evaluate search results by hand and using TREC eval software.

Specification:

This homework consists of three parts: A set of book problems, which are to be done individually, should be handwritten (no typing), and must be turned in at the start of class on the day indicated; a coding assignment; and an evaluation task. The latter two parts of this assignment can be done in a team of at most three people. For this portion only one team member needs to submit -- just be sure to mention the names of your team mates in the documentation for your code.

The problems I want you to do are: Exercise 4.1 and 4.2 (due Oct. 10). I know, you might have done 4.1 for Oct. 3 even though we hadn't covered the material yet -- in which case, just turn in the answer again. Next, assume `k_1` and `b` are 1.2 and 0.75 respectively. Suppose we have a corpus of 25 million documents. The word "big" appears in 1 million, "mac" appears in 25 thousand, and "lots" appears in 10 thousand. The average document is 300 words long. Document 27 contains, the word "big" eight times, "mac" three times and "lots" once. It is 700 words long. Calculate the BM25 score for Document 27 for the queries "big lots" and "big mac". Then do Exercise 5.1 (due Oct 17). Exercise 5.2 and 5.4. (due Oct 24). Exercise 6.1. Then figure out what would happen if you tried to do arithmetic coding for the symbols in problem 6.1 (Oct. 31)

For the coding portion of the assignment I want you to extend the program you wrote for Hw2 so that now it takes as a possible third arguments proximity and bm25. You can either start with your own code or extend the Hw2 solution. If proximity is selected as the ranking method, then it should calculate the proximity scores of all the documents which contain the query terms and output the results in decreasing order of proximity relevance. If bm25 is selected as the ranking method, then it should calculate the bm25 scores of all the documents which contain the query terms and output the results in decreasing order of BM25 relevance. The output format should be modified from Hw2 so that it corresponds to a trec_eval results file.

For the last part of the homework I want you to download and compile the trec_eval program from NIST. I want you to create a text_qrels_file.txt by hand based for your corpus and topics of HW2. Then I want you to carry out some experiments to compare using trec_eval the results returned by cosine rank versus those returned by BM25 and proximity ranking. Your zip file should contain all the text files you generated for your experiments and it should contain Experiments.txt which explains what you did and summarizes the results you got. It should also contain actual trec_eval output.

Point Breakdown

Book problems (1pt each graded on scale 0, 1/2 partial, 1 completely correct)	4pts
Proximity extension of HW2 code works as describes	2pts
BM25 extension of HW2 code works as describes	2pts
File such as qrel and results files are contained in the folder and are in the correct format.	1pt
Experiments.txt shows that a reasonable trec_eval experiment was conducted.	1pt
Total	10pts